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I'm not dead! 
‘Ere, he says he's not dead. 
Yes he is. These tutorials are a simplified 
I'm not. introduction, and are not sufficient on 
He isnt their own to achieve system safety. 
ie ‘ You are responsible for the safety of 
Well, he will be soon, he's very ill. your system. 


I'm getting better! - Monty Python © 2020 Philip Koopman 4 
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= Anti-Patterns for Redundancy: 
e Unsafe because double-spending redundancy 
e No between-mission redundancy diagnostics 
e Low test coverage on redundant components 





a piv: 


= Redundant components help reliability ae = 
e But, what happens when a component breaks? - Be ae 
— Need to gracefully curtail current mission a 
— Prohibit additional missions until repaired 
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Figur e 1. Postaccident a ‘areata of Whatcom Creek 
wing fire damage. 


Bellingham WA, 
i Miceae ‘ eee < June 1999: Gasoline 
e Reliability assumes perfection at mission start spill & fire kills 3 due to 
. oe improper management 
Untested redundancy undermines reliability of SCADA redundancy 
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Response To A Component Failure ae 


= Use of Redundancy: Availability Hot Standby | PRIMARY 
X SYSTEM 
e Hot Standby takes over upon failure Pattern 
FAIL-OVER 
UPON FAULT 


HOT 
STANDBY 


e Assumes somehow you detect failure Remember 


— For low criticality systems, perhaps it’s OK to to eM 
miss some failures; have human trigger failover hog E 





= Even if only one component breaks at a time... 
e Single computer can fail “active” (dangerous) 
e Self-test cannot find all faults 
2-of-2 Fail Silent Pattern 


e Single component is unsafe for SIL 3,4 mnt | ae a oe oat a A 


° : CHANNEL |. CROSS- . | CHANNEL : 
= Use of Redundancy: Fault Detection 


e 2-of-2 used for fault detection i FAIL-SILENT SYSTEM COMPONENT 


SR 
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Fail Operational Approaches a 
2-of-3 





= Can't double-spend redundancy! 


e Need 2 components to detect a failure Voter 
e PLUS more components to operate after failure = /iasority Pattern 
VOTER 
° FAIL-OPERATIONAL 

= Triplex modular redundancy (2-of-3) OUTPUT 

@ Three copies Of SUBSYSTEM ANC VOCE ———_eeieececesnsssnsnsntnnnsntntntnsntntnnnnnntnee ee 

; : : a 

e But... voter can be single point of failure!  2-0f- 9 

ND ye ees ea a a any Pattern 
m@ Vual Z-OT- : FAIL-OVER 

Via FAULT 


e Two copies of subsystem for availability een nmmnnnnnnnnnns 


; es ‘Ic cRoss- .|c 
e Each subsystem is 2-of-2 Her ug ereeK eas | 


to provide fault detection | HOT STANDBY (FAIL-SILENT) 
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- Doer/Checker & Redundancy Mellon 


= Hybrid of Low SIL Doer and High SIL 2-of-2 checker 
e Single Low SIL primary 


— Provides normal functionality 

e 2-of-2 High SIL checker 
— Shuts down if primary unsafe mee | aR en 
— Shuts down if cross-check fails 


ee £2 eee Me. | 


‘| HIGHSIL |. cRoss-.| HIGHSIL |; 
= Common building blocks: : | CHECKER #1|" CHECK ~| CHECKER #2| ; 


e 2-of-2 for fault detection : 2-OF-2 CHECKER (FAIL-SILENT) 7 
e Doer/Checker for fault isolation 


, , Mixed-SIL 
e Hot standby for fail operational Doer/Checker Pattern 
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Diagnostic Effectiveness 


= Reliability math assumes a//redundancy working 
e On-line diagnostics: self-test at start of mission 
—- Example: IEC 60730 self-test library 
e Off-line diagnostics: “Proof test” 
— Example: exercise an elevator safety limit switch 





= Latent undetected faults 

e Undetectable faults lead to coincident failures 
— 2-of-2 doesn't work if both fail the same way! 

e Run-time detection: frequent health cross-checks 
— Scrub state, e.g., compare RAM values 
— Swap active units periodically to self-test 

e Off-line detection: enforce periodic proof tests 
— Self-test or require diagnostic to resume operation 
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Best Practices For Redundancy Management ee 





= What happens when component fails? [) 9 -FSNe5.°7; 
e Some redundancy is for fault detection RS 5, 
e Other redundancy is for availability 
e Plan how to detect & survive failures 


= Diagnostic coverage matters 
e Pre-mission test; cross-checks; proof tests fi 





@ Minimize potential for latent faults | ee 
Safety Instrumented Function (SIF) 
a Pitfal ls: Failure at an Undisclosed Plant 


e Don't double-spend your redundancy (detect & failover are different) 
e Look for common-mode failures (e.g., software updates) 
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